Spatiotemporal Data

Background.

Industrialization and manufacturing growth in China has been accompanied by major growth in pollution. Key pollutants include fine particulate matter (PM2.5 and PM10), nitrogen dioxide (NO2), sulfur dioxide (SO2), surface ozone (O3), and carbon monoxide (CO). The US EPA’s Air Quality Index (AQI) is one index of air quality based on these six pollutants. To attain a “good” AQI based on PM2.5, the 24-hour average has to be less than 12 μg/m^3; a 24-hour average PM2.5-level of at least 55.5 μg/m^3 is considered “unhealthy,” while levels above 250.5 μg/m3 are considered “hazardous.” The data here come from Beijing, notable for poor air quality.

Data

The file beijing.csv contains 420,768 hourly measurements of six pollutants and weather-related factors measured at 12 monitoring stations around Beijing from March 1, 2013 through February 28, 2017. These files have not been cleaned. The data are as follows, with all pollutants measured in micrograms per cubic meter (μg/m^3).

  • year: year of measurement
  • month: month of measurement
  • day: day of measurement
  • hour: hour of measurement
  • PM2.5: PM2.5 concentration
  • PM10: PM10 concentration
  • SO2: sulfur dioxide concentration
  • NO2: nitrogen dioxide concentration
  • O3: surface ozone concentration
  • CO: carbon monoxide concentration
  • TEMP: temperature in degrees Celsius
  • PRES: barometric pressure in hectopascals
  • RAIN: precipitation in millimeters
  • wd: wind direction; 16 compass directions (N, NNE, NE, ENE, E, etc.)
  • WSPM: wind speed in meters per second
  • station: monitoring site (twelve unique sites spaced around the city)

1. Data Cleaning

a. We have data at an hourly scale, but there are a couple derived attributes related to date that we may find useful:

  • Add a column for “day of year” (for example, January 3rd is the third day of the year) using the yday function in the lubridate package.
  • Add a column for “weekday” using your method of choice. Hint: format( , "%A") may be useful.
  • Use (ISOdatetime()) to build a new column called as_datetime. Set the minute and hour to 0.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
beijing <- read.csv("beijing.csv")

beijing$as_datetime = ISOdatetime(beijing$year, beijing$month, beijing$day, beijing$hour, 0, 0)
beijing$weekday <- format(beijing$as_datetime, "%A")
beijing$day_of_year <- yday(beijing$as_datetime)

b. Each measurement is associated with a weather station. The location of each station in is stations.csv. Inner join the information from that file into our air quality data.

stations <- read.csv("stations.csv")
beijing <- inner_join(beijing, stations, by=c("station"))

c. This data is currently in a wide format with respect to the pollutants (one column for each pollutant), but we would like to compare trends across pollutants. Ideally we would have figures that summarizes ALL our data, while splitting up by pollutant. This calls for a long format. Use the pivot_longer function from the tidyverse to make a new data frame with a row for each pollutant, and a column titled pollutant for the name of the pollutant and concentration for the concentration of it.

beijing_long <- pivot_longer(beijing, 
                             c("PM2.5", "PM10", "SO2", "NO2", "CO", "O3"), 
                             names_to="pollutant", 
                             values_to="concentration")

2. Temporal patterns

a. Build a single figure that describes the distribution of the pollutants depending on the day of the week. To make your figure more informative, consider the order that the weekdays are displayed. For this first figure, do not use facet_wrap or facet_grid and instead set the x aesthetic to pollutant.

beijing_long$weekday <- factor(beijing_long$weekday, 
                               levels=c("Monday", "Tuesday", "Wednesday", 
                                        "Thursday", "Friday", "Saturday", 
                                        "Sunday"))
ggplot(beijing_long, aes(pollutant, concentration, fill=weekday)) +
    geom_boxplot()
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).

b. Wow that figure looks pretty rough since the pollutants concentrations vary not only across pollutants, but also within due to the huge amount of outliers! Make the same figure but use a log scale. What base of the logarithm should you use?

ggplot(beijing_long, aes(pollutant, concentration, fill=weekday)) +
    geom_boxplot() +
    scale_y_continuous(trans='log10')
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).

c. Now make the same graph but use facet_wrap or facet_grid. Is this figure, or the one above more informative (your opinion)?

ggplot(beijing_long, aes(y = concentration, fill=weekday)) +
    geom_boxplot() +
    scale_y_continuous(trans='log10') +
    facet_wrap(~pollutant)
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).

d. The distributions look much easier to visualize now while not comprimising the conclusions we can draw. Does the day of the week seem to make a difference?

e. Make a similar figure with a log scale, but showing the change in pollutants by month. It is your choice if you would like to facet or not. Describe some patterns you see in the data.

ggplot(beijing_long, aes(pollutant, concentration, fill=factor(month))) +
    geom_boxplot() +
    scale_y_continuous(trans='log10') +
    scale_fill_discrete(labels=month.abb)
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).

f. Finally make a similar figure, but on an hourly scale. How could this plot be improved?

ggplot(beijing_long, aes(pollutant, concentration, fill=factor(hour))) +
    geom_boxplot() +
    scale_y_continuous(trans='log10') +
    ylab("Concentration")
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).

3. Wind Direction

a. Beijing residents believe that the worst air quality occurs during easterly winds. Make a data frame of summary statistics describing the relative magnitude of each pollutant by wind direction. If you encounter NAs you may ignore them. Your data frame should allow someone to quickly answer the following questions:

  • For a given pollutant and wind direction, what is the mean concentration?
  • For a given pollutant and wind direction, what is the ratio of mean concentration to the highest mean concentration for that pollutant? For example: if the mean pollutant concentration for CO in the east direction is 10, and the mean pollutant concentration for CO in the west is 100 and also the max, we would have 0.1 relative magnitude.
  • What is the average wind speed for a given wind direction?
  • What angle is a wind direction?

Hint: You may find the following code chunk helpful.

# https://stackoverflow.com/questions/42597344/convert-from-compass-direcitions-to-degrees-r
wind_direction_map <- setNames(
    seq(0, 337.5 , by=22.5), 
    c("N", "NNE", "NE", "ENE", "E", "ESE", "SE", "SSE", 
      "S", "SSW", "SW", "WSW", "W", "WNW", "NW", "NNW")
)

print(wind_direction_map["N"])
## N 
## 0
wind_direction_info <- beijing_long %>%
  drop_na(wd, pollutant) %>%
  group_by(wd, pollutant) %>%
  summarise(
    mean_speed = mean(WSPM),
    mean_value = mean(na.omit(concentration)),
    angle = wind_direction_map[first(wd)]
  )
## `summarise()` has grouped output by 'wd'. You can override using the `.groups` argument.
head(wind_direction_info)
## # A tibble: 6 × 5
## # Groups:   wd [1]
##   wd    pollutant mean_speed mean_value angle
##   <chr> <chr>          <dbl>      <dbl> <dbl>
## 1 E     CO              1.31     1579.     90
## 2 E     NO2             1.31       59.7    90
## 3 E     O3              1.31       47.7    90
## 4 E     PM10            1.31      124.     90
## 5 E     PM2.5           1.31      102.     90
## 6 E     SO2             1.31       18.4    90
max_mean_by_pollutant <- wind_direction_info %>%
  group_by(pollutant) %>%
  summarise(max_mean = max(mean_value))

wind_direction_info <-
  inner_join(wind_direction_info, max_mean_by_pollutant, by = c("pollutant"))

head(wind_direction_info)
## # A tibble: 6 × 6
## # Groups:   wd [1]
##   wd    pollutant mean_speed mean_value angle max_mean
##   <chr> <chr>          <dbl>      <dbl> <dbl>    <dbl>
## 1 E     CO              1.31     1579.     90   1621. 
## 2 E     NO2             1.31       59.7    90     64.1
## 3 E     O3              1.31       47.7    90     95.3
## 4 E     PM10            1.31      124.     90    126. 
## 5 E     PM2.5           1.31      102.     90    102. 
## 6 E     SO2             1.31       18.4    90     18.7

b. Build a single figure using polar coordinates which describes the average relative magnitude for each pollutant when grouped by wind direction. Account for wind speed in some way in your figure.

ggplot(wind_direction_info,
       aes(
         x = angle,
         y = mean_value / max_mean,
         fill = mean_speed
       )) +
  geom_bar(width = 22.5,
           stat = "identity",
           color = "white") +
  scale_fill_gradient(low = "blue", high = "red") +
  coord_polar(start = -pi / 16) +
  theme(
    axis.ticks = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank(),
    axis.line = element_blank(),
    legend.position = "bottom",
    legend.key.width = unit(1.5, "cm")
  ) +
  geom_text(aes(angle, 1.1, label = wd), color = "black", size = 2) +
  ggtitle("Average of Pollutant vs. Direction") +
  guides(fill = guide_colorbar(title = "Wind Speed (m/s)")) +
  facet_wrap( ~ pollutant)

c. Does your visualization support the belief of Beijing residents?

4. Effect of Government Plan

Beijing experienced one of the worst air quality crises in Chinese history in January 2013, with the entire city covered in a dense gray haze visible from space. This experience led the Chinese government to immediately draft an air pollution control plan that was implemented effective September 2013. We will build a visualization similar to the one below, where the line graphs for each year are displayed on top of each other, but with the year 2013 highlighted:

a. To make our graph a little easier to build, let’s aggregate our data by day. Make another data frame that summarizes each pollutant/day pair by the mean of each pollutant across all of Beijing. Include the day of year attribute from earlier in your summary dataframe. Name your dataframe beijing_by_day.

Hint: Your new dataframe should have at least the following column names:

"year", "month", "day", "pollutant", "average_concentration", "day_of_year"

beijing_by_day <- beijing_long %>%
  group_by(year, month, day, pollutant) %>%
  summarise(average_concentration = mean(na.omit(concentration)),
            day_of_year = first(day_of_year))
## `summarise()` has grouped output by 'year', 'month', 'day'. You can override using the `.groups` argument.

b. Because the data is so erratic, it helps to visualize the trends by using some type of rolling average. Use rollmean, rollmax, rollmedian, or rollsum from the zoo package to do some type of smoothing and make a new data frame with the smoothed data. Fill in the starter code below with your choice of rolling average and an appropriate smoothing window.

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
beijing_smoothed <- beijing_by_day %>%
  arrange(day_of_year) %>%
  group_by(pollutant, year) %>%
  summarise(
    # below line is one that needs to be filled in
    rolling_average = rollmedian(average_concentration, 7,
                                 na.pad = TRUE),
    day_of_year = day_of_year
  )
## `summarise()` has grouped output by 'pollutant', 'year'. You can override using the `.groups` argument.

c. Create the plot mentioned above with your smoothed data.

Hint: Use the “day of year” column defined earlier.

Bonus: Format the x-axis ticks to display an appropriate date format.

ggplot(
  beijing_smoothed,
  aes(
    x = day_of_year,
    y = rolling_average,
    color = factor(year),
    alpha = factor(year)
  )
) +
  geom_line() +
  facet_wrap( ~ pollutant) +
  xlab("Day of Year") +
  ylab("Log of 7-day rolling median for pollutant daily mean") +
  scale_y_continuous(trans = "log10") +
  scale_alpha_manual(values = c(1, 0.15, 0.15, 0.15, 0.15)) +
  scale_x_continuous(breaks = c(1, 91, 182, 274), labels=c("Jan", "Apr", "Jul", "Oct"))
## Warning: Removed 30 row(s) containing missing values (geom_path).

d. What visual evidence (if any) is there that the plan in 2013 was effective? Justify your choice of smoothing function and window.

5. Animate the Data

a. We will now animate the data while using ggmap and gganimate. Pick a map from the ./maps folder. Justify your map choice. You can see what a map looks like with code similar to the below chunk:

library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
load(file = "./maps/terrain_lines_bw.RData") # sets map variable
ggmap(map)

b. For simplicity, choose a single pollutant you would like to visualize. Set the single_frame variable below to a filtered version of the beijing data frame with a fixed year, month, day, hour, and pollutant.

single_frame <- beijing_long %>%
    filter(year==2015, month==7, day==2, hour==3, pollutant=="PM2.5")

Less clutter bt roads and cell borders

Elaboration on cells, criticize use of cells

c. Use ggmap to make a figure visualizing the pollutant and hour you chose at the stations. We will do this in steps:

i. Use geom_segment to draw arrows at the locations of the given stations with the given starter code. For the aes of geom_segment:

map_step_1 <- ggmap(map, extent="device", legend="bottomright") + 
  geom_segment(
    data=single_frame, 
    aes(x=long, 
        y=lat, 
        xend=long + cos(wind_direction_map[wd] * pi/180) * WSPM/50, 
        yend=lat + sin(wind_direction_map[wd] * pi/180) * WSPM/50,
        color=concentration),
    arrow=arrow(length=unit(0.2, "cm"))) +
  scale_color_viridis_c()

ii. Use geom_voronoi and stat_voronoi to draw Voronoi cells on the map, where seed points are station locations. The cells do not need to cover the entire map.

library(ggvoronoi)

map_step_2 <- map_step_1 +
    geom_voronoi(
        data=single_frame, 
        aes(x=long, y=lat, fill=TEMP), 
        alpha=0.1) +
    stat_voronoi(
        data=single_frame,
        aes(x=long, y=lat),
        geom = "path") +
  scale_fill_gradient(low="blue", high="red")

d. Use gganimate to animate the data for a week. Reuse your code above for each frame. Make sure you set the title to reflect the current time being shown.

library(gganimate)

one_week <- beijing_long %>% 
  filter(year == 2015, month == 5, day %in% 1:7, pollutant == "PM2.5")

our_animation <- ggmap(map, extent="device", legend="bottomright") + 
  geom_segment(
    data=one_week, 
    aes(x=long, 
        y=lat, 
        xend=long + cos(wind_direction_map[wd] * pi/180) * WSPM/50, 
        yend=lat + sin(wind_direction_map[wd] * pi/180) * WSPM/50,
        color=concentration),
    arrow=arrow(length=unit(0.2, "cm"))) +
    geom_voronoi(
        data=single_frame, 
        aes(x=long, y=lat, fill=TEMP), 
        alpha=0.1) +
    stat_voronoi(
        data=single_frame,
        aes(x=long, y=lat),
        geom = "path") +
  scale_fill_gradient(low="blue", high="red") + 
  scale_color_viridis_c() +
  transition_time(as_datetime) +
  labs(x="Longitude", 
       y="Latitude", 
       title = "{frame_time}", 
       fill="Temperature (C)", 
       color="Concentration (μg/m^3)")
animate(our_animation, duration = 5, fps = 6, width = 1000, height = 1000, renderer = gifski_renderer())

anim_save("output.gif")

e. Were the Voronoi cells helpful for this visualization? Why or why not?